AmyloGram: Analysis of proteins in R

Jarek Chilimoniuk

Department of Bioinformatics and Genomics, University of Wrocław

Proteins

Amino acids

Amino acids

Proteins

Proteins

Proteins

Protein higher order structures determines its function.

Human proteom

1937 human proteins have unknown role (dark proteome) (Young-Ki Paik et al., 2018).

Goal

Development of methods for predicting protein properties on the basis of their primary structure in a way that is understandable for biologists and experimentally validated.

n-grams and reduced alphabets

n-grams (k-tuple, k-mers):

  • subsequences (continuous or discontinuous) n amino acid or nucleotide residues,
  • more informative than the individual residues

Peptide I: FKVWPDHGSG

Peptide II: YMCIYRAQTN

n-gram examples from peptide I and II:

  • 1-gram: F, Y, K, M,
  • 2-gram: FK, YM, KV, MC,
  • 2-gram (discontinuous): F-V, Y-C, K-W, M-I,
  • 3-gram (discontinuous): F–WP, Y–IY, K–PD, M–YR.

Longer n-grams are more informative, but create larger attribute spaces that are more difficult to analyze.

slam: Sparse Lightweight Arrays and Matrices

Counting n-grams creates sparse matrices, that are causing dimensional problems.

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.1.3     ✔ purrr   0.3.2
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ crayon::%+%()    masks ggplot2::%+%()
## ✖ dplyr::combine() masks gridExtra::combine()
## ✖ seqinr::count()  masks dplyr::count()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()

slam: Sparse Lightweight Arrays and Matrices

## % latex table generated in R 3.6.1 by xtable 1.8-3 package
## % Mon Sep 23 19:52:20 2019
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrll}
##   \hline
##  & Number of sparse matrices & Package & File size [Mb] \\ 
##   \hline
## 1 & 1.00 & base & 0.000214 Mb \\ 
##   2 & 1.00 & slam & 0.001122 Mb \\ 
##   3 & 10.00 & base & 0.000969 Mb \\ 
##   4 & 10.00 & slam & 0.001312 Mb \\ 
##   5 & 100.00 & base & 0.0765 Mb \\ 
##   6 & 100.00 & slam & 0.002625 Mb \\ 
##   7 & 1000.00 & base & 7.629601 Mb \\ 
##   8 & 1000.00 & slam & 0.016357 Mb \\ 
##   9 & 10000.00 & base & 762.939659 Mb \\ 
##   10 & 10000.00 & slam & 0.153687 Mb \\ 
##    \hline
## \end{tabular}
## \end{table}

Reduced alphabets

Reduced alphabets:

  • amino acids are grouped into larger yields on the basis of specific criteria,
  • easier anticipation of structures (Murphy, Wallqvist, and Levy 2000),
  • creation of more generalised models.

Reduced alphabets

Following peptides appear to be completely different in terms of amino acid composition.


Peptide I:

FKVWPDHGSG


Peptide II:

YMCIYRAQTN

## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

Group Amino acids
1 C, I, L, K, M, F, P, W, Y, V
2 A, D, E, G, H, N, Q, R, S, T





Peptide I:        FKVWPDHGSG        —–>        1111122222

Peptide II:        YMCIYRAQTN         —–>        1111122222

Amyloid prediction

Amyloids

Amyloid aggregates are found in tissues of people suffering from neurodegenerative disorders such as Alzheimer’s disease, Parkinson’s disease and many other diseases.

Amyloid aggregates (red) around neurons (green). Strittmatter Laboratory, Yale University.

Amyloids

Source: National Institute on Aging (NIA) | National Institutes of Health (NIH)

Amyloid proteins

Peptide sequences with amyloidogenic properties are responsible for the aggregation of amyloidogenic proteins (hot spots):

  • short (6-15 amino acids),
  • very variable, usually hydrophobic amino acid composition,
  • create unique \(\beta\)-structures.

(Sawaya et al. 2007)

AmyloGram: n-gram-based amyloid prediction tool




Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

QuiPT

Quick Permutation Test is a fast alternative to permutation tests for n-gram data. It also allows precise estimation of p-value.

QuiPT is avaible as part of the biogram R package.

ranger: A Fast Implementation of Random Forests

Package Runtime [h] Memory usage [GB]
mtry=
5000 15,000 135,000
randomForest 101.24 116.15 248.60 39.05
randomForest (MC) 32.10 53.84 110.85 105.77
bigrf NA NA NA NA
randomForestSRC 1.27 3.16 14.55 46.82
Random Jungle 1.51 3.60 12.83 0.40
Rborist NA NA NA >128
ranger 0.56 1.05 4.58 11.26
ranger (save.memory) 0.93 2.39 11.15 0.24
ranger (GWAS mode) 0.23 0.51 2.32 0.23

Runtime and memory usage for the analysis of a simulated dataset mimicking a genome-wide association study (GWAS). NA values indicate unsuccessful analyses:
without disk caching failed because of memory shortage for all mtry values and number of CPU cores.
With disk caching, we stopped bigrf after 16 days of computation.}

Marvin N. Wright and Andreas Ziegler. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 1, 77

ranger: A Fast Implementation of Random Forests

Marvin N. Wright and Andreas Ziegler. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 1, 77

Cross-validation


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Standard reduced alphabets

Do standard reduced alphabets developed for different biological issues help to improve amyloid prediction?

Standard reduced alphabets


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Standard reduced alphabet

Standard amino acid alphabets do not improve the quality of amyloid prediction.

% Standard reduced amino acid alphabets do not enhance discrimination between amyloidogenic and non-amyloidogenic proteins.


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Novel reduced amino acid alphabets

17 measures handpicked from AAIndex database: - size of residues, - hydrophobicity, - solvent surface area, - frequency in \(\beta\)-sheets, - contactivity.


524 284 amino acid reduced alphabets with different level of amino acid alphabet reduction (three to six amino acid groups).



Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Novel reduced amino acid alphabets}


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Standard reduced alphabet

% %Hinges of boxes correspond to %the 0.25 and 0.75 quartiles. The bar inside the box represents the median. The %gray circles correspond to the reduced alphabets with the AUC outside the 0.95 %confidence interval.


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Selection of best-performing reduced alphabet

## Using alph as id variables

Selection of best-performing reduced alphabet

For each category the alphabets have been ranked (rank 1 for the best AUC, etc.).

Selection of best-performing reduced alphabet

The best alphabet was the one with the lowest rank sum.

Best-performing reduced alphabet


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Best-performing reduced alphabet}


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

 \begin{frame}{Best-performing reduced alphabet}

Group 3 i 4 - hydrophobic amino acids.


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Best-performing reduced alphabet

Group 2 - amino acids disrupting the \(\beta\)-structure (\(\beta\)-breakers).


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Alphabet similarity and quality of prediction

% Czy alfabety podobne do najlepszego uproszczonego alfabetu również wspierają przewidywania amyloidów? % Do alphabets similar to the best reduced alphabet also support amyloid predictions? Is the best-performing reduced amino acid alphabet associated with amyloidogenicity?

Similarity index

Similarity index (Stephenson and Freeland 2013) measures the similarity between two reduced alphabets (1:~identical alphabets, 0:~completely dissimilar alphabets).

Similarity index

The correlation between the similarity index and the average AUC is important (\(\textrm{p-value} \leq 2.2^{-16}\); \(\rho = 0.51\)).


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

% Are the informative n-grams found by QuiPT are connected with amyloidogenicity? Are informative n-grams found by QuiPT associated with amyloidogenicity?

Informative n-grams

## Using decoded_name, association, amyloid as id variables

Out of 65 the most informative n-grams, 15 (23%) were also found in the motifs validated experimentally (Paz and Serrano 2004).

Informative n-grams

## Using decoded_name, association, amyloid as id variables

% Spośród 65 najbardziej informatywnych n-gramów, 15 (23%) jest również obecnych w motywach aminokwasowych znalezionych ekperymentalnie (Paz and Serrano 2004).

Of the 65 most informative n-grams, 15 (23%) are also present in amino acid motifs found experimentally (Paz and Serrano 2004).

Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Benchmark results

The classifier trained using the best reduced alphabet, AmyloGram, has been compared with other amyloid prediction tools using an external dataset .

MCC (Matthew’s Correlation Coefficient) measures the performance of a classifier (1 - classifier always properly recognizes amyloid proteins, -1 - classifier never properly recognizes amyloid proteins)


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Experimental validation

Experimental validation

New amyloid

A new functional amyloid produced by Methanospirillum sp. (Christensen et al. 2018) was selected for analysis by AmyloGram.

Shiny aplication

AmyloGram web server

AmyloGram web server

AmyloGram web server

\begin{frame}{Summary}

Web servers:

R packages:

Summary

Acknowledgements

  • Michał Burdukiewicz (Warsaw University of technology).
  • Małgorzata Kotulska (Wrocław University of Science and Technology).
  • Stefan Rödiger (Brandenburg University of Technology Cottbus-Senftenberg).
  • Paweł Mackiewicz (University of Wrocław).
  • Piotr Sobczyk (Wrocław University of Science and Technology).

Acknowledgements

Funding:

  • Polish National Science Centre (2015/17/N/NZ2/01845 i 2017/24/T/NZ2/00003).
  • COST ACTION CA15110 (Harmonising standardisation strategies to increase efficiency and competitiveness of European life-science research).
  • KNOW Wrocław Center for Biotechnology.
  • German Federal Ministry of Education and Research (InnoProfile-Transfer-Projekt 03IPT611X).

Burdukiewicz, Michał, Piotr Sobczyk, Stefan Rödiger, Anna Duda-Madej, Paweł Mackiewicz, and Małgorzata Kotulska. 2016. “Prediction of Amyloidogenicity Based on the N-Gram Analysis.” e2390v1. PeerJ Preprints. https://peerj.com/preprints/2390.

Burdukiewicz, Michał, Piotr Sobczyk, Stefan Rödiger, Anna Duda-Madej, Paweł Mackiewicz, and Małgorzata Kotulska. 2017. “Amyloidogenic Motifs Revealed by N-Gram Analysis.” Scientific Reports 7 (1): 12961. doi:10.1038/s41598-017-13210-9.

Christensen, Line Friis Bakmann, Lonnie Maria Hansen, Kai Finster, Gunna Christiansen, Per Halkjær Nielsen, Daniel Erik Otzen, and Morten Simonsen Dueholm. 2018. “The Sheaths of Methanospirillum Are Made of a New Type of Amyloid Protein.” Frontiers in Microbiology 9: 2729. doi:10.3389/fmicb.2018.02729.

Murphy, Lynne Reed, Anders Wallqvist, and Ronald M. Levy. 2000. “Simplified Amino Acid Alphabets for Protein Fold Recognition and Implications for Folding.” Protein Engineering 13 (3): 149–52. doi:10.1093/protein/13.3.149.

Paz, Manuela López de la, and Luis Serrano. 2004. “Sequence Determinants of Amyloid Fibril Formation.” Proceedings of the National Academy of Sciences 101 (1): 87–92. doi:10.1073/pnas.2634884100.

Sawaya, Michael R., Shilpa Sambashivan, Rebecca Nelson, Magdalena I. Ivanova, Stuart A. Sievers, Marcin I. Apostol, Michael J. Thompson, et al. 2007. “Atomic Structures of Amyloid Cross-β Spines Reveal Varied Steric Zippers.” Nature 447 (7143): 453–57. doi:10.1038/nature05695.

Stephenson, James D., and Stephen J. Freeland. 2013. “Unearthing the Root of Amino Acid Similarity.” Journal of Molecular Evolution 77 (4): 159–69. doi:10.1007/s00239-013-9565-0.